Principle of Analytic Graphics

Exploratory Graphs (examples)

One Dimension Summary of Data

  • summary(data) = returns min, 1st quartile, median, mean, 3rd quartile, max
  • boxplot(data, col = “blue”) = produces a box with middles 50% highlighted in the specified color
    • whiskers = \(\pm 1.58IQR/\sqrt{n}\)
      • IQR = interquartile range, Q\(_3\) - Q\(_1\)
    • box = 25%, median, 75%
  • histograms(data, col = “green”) = produces a histogram with specified breaks and color
    • breaks = 100 = the higher the number is the smaller/narrower the histogram columns are
  • rug(data) = density plot, add a strip under the histogram indicating location of each data point
  • barplot(data, col = wheat) = produces a bar graph, usually for categorical data

  • Overlaying Features
  • abline(h/v = 12) = overlays horizontal/vertical line at specified location
    • col = “red” = specifies color
    • lwd = 4 = line width
    • lty = 2 = line type

Two Dimensional Summaries

  • multiple/overlay 1D plots (using lattice/ggplot2)
  • box plots: boxplot(pm25 ~ region, data = pollution, col = “red”)

  • histogram:
    • par(mfrow = c(2, 1), mar = c(4, 4, 2, 1)) = set margin
    • hist(subset(pollution, region == "east")$pm25, col = "green") = first histogram
    • hist(subset(pollution, region == "west")$pm25, col = "green") = second histogram

  • scatterplot
    • with(pollution, plot(latitude, pm25, col = region))
    • abline(h = 12, lwd = 2, lty = 2) = plots horizontal dotted line
    • plot(jitter(child, 4)~parent, galton) = spreads out data points at the same position to simulate measurement error/make high frequency more visibble

  • multiple scatter plots
    • par(mfrow = c(1, 2), mar = c(5, 4, 2, 1)) = sets margins
    • with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West")) = left scatterplot
    • with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East")) = right scatterplot

Process of Making a Plot/Considerations

  • where will plot be made? screen or file?
  • how will plot be used? viewing on screen/web browser/print/presentation?
  • large amount of data vs few points?
  • need to be able to dynamically resize?
  • plotting system: base, lattice, ggplot2?

Base Plotting

Base Graphics Functions and Parameters

  • arguments
    • pch: plotting symbol (default = open circle)
    • lty: line type (default is solid)
      • 0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash
    • lwd: line width (integer)
    • col: plotting color (number string or hexcode, colors() returns vector of colors)
    • xlab, ylab: x-y label character strings
    • cex: numerical value giving the amount by which plotting text/symbols should be magnified relative to the default
      • cex = 0.15 * variable: plot size as an additional variable
  • par() function = specifies global graphics parameters, affects all plots in an R session (can be overridden)
    • las: orientation of axis labels
    • bg: background color
    • mar: margin size (order = bottom left top right)
    • oma: outer margin size (default = 0 for all sides)
    • mfrow: number of plots per row, column (plots are filled row-wise)
    • mfcol: number of plots per row, column (plots are filled column-wise)
    • can verify all above parameters by calling par("parameter")
  • plotting functions
    • lines: adds liens to a plot, given a vector of x values and corresponding vector of y values
    • points: adds a point to the plot
    • text: add text labels to a plot using specified x,y coordinates
    • title: add annotations to x,y axis labels, title, subtitles, outer margin
    • mtext: add arbitrary text to margins (inner or outer) of plot
    • axis: specify axis ticks

Graphics Device

lattice Plotting System

lattice Functions and Parameters

  • Funtions
    • xyplot() = main function for creating scatterplots
    • bwplot() = box and whiskers plots (box plots)
    • histogram() = histograms
    • stripplot() = box plot with actual points
    • dotplot() = plot dots on “violin strings”
    • splom() = scatterplot matrix (like pairs() in base plotting system)
    • levelplot()/contourplot() = plotting image data
  • Arguments for xyplot(y ~ x | f * g, data, layout, panel)
    • default blue open circles for data points
    • formula notation is used here (~) = left hand side is the y-axis variable, and the right hand side is the x-axis variable
    • f/g = conditioning/categorical variables (optional)
      • basically creates multi-panelled plots (for different factor levels)
      • * indicates interaction between two variables
      • intuitively, the xyplot displays a graph between x and y for every level of f and g
    • data = the data frame/list from which the variables should be looked up
      • if nothing is passed, the parent frame is used (searching for variables in the workspace)
      • if no other arguments are passed, defaults will be used
    • layout = specifies how the different plots will appear
      • layout = c(5, 1) = produces 5 subplots in a horizontal fashion
      • padding/spacing/margin automatically set
    • [optional] panel function can be added to control what is plotted inside each panel of the plot
      • panel functions receive x/y coordinates of the data points in their panel (along with any additional arguments)
      • ?panel.xyplot = brings up documentation for the panel functions
      • Note: no base plot functions can be used for lattice plots

ggplot2 Plotting System

“In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system”

ggplot2 Functions and Parameters

  • basic components of a ggplot2 graphic
    • data frame = source of data
    • aesthetic mappings = how data are mappped to color/size (x vs y)
    • geoms = geometric objects like points/lines/shapes to put on page
    • facets = conditional plots using factor variables/multiple panels
    • stats = statistical transformations like binning/quantiles/smoothing
    • scales = scale aesthetic map uses (i.e. male = red, female = blue)
    • coordinate system = system in which data are plotted
  • qplot(x, y, data , color, geom) = quick plot, analogous to base system’s plot() function
    • default style: gray background, white gridlines, x and y labels automatic, and solid black circles for data points
    • data always comes from data frame (in unspecified, function will look for data in workspace)
    • plots are made up of aesthetics (size, shape, color) and geoms (points, lines)
    • Note: capable of producing quick graphics, but difficult to customize in detail
  • factor variables: important for graphing subsets of data = they should be labelled with specific information, and not just 1, 2, 3
    • color = factor1 = use the factor variable to display subsets of data in different colors on the same plot (legend automatically generated)
    • shape = factor2 = use the factor variable to display subsets of data in different shapes on the same plot (legend automatically generated)
    • example

  • adding statistics: geom = c("points", "smooth") = add a smoother/“low S”
    • “points” plots the data themselves, “smooth” plots a smooth mean line in blue with an area of 95% confidence interval shaded in dark gray
    • method = "lm" = additional argument method can be specified to create different lines/confidence intervals
      • lm = linear regression
    • example
## Warning: Ignoring unknown parameters: method

  • histograms: if only one value is specified, a histogram is produced
    • fill = factor1 = can be used to fill the histogram with different colors for the subsets (legend automatically generated)
    • example

  • facets: similar to panels in lattice, split data according to factor variables
    • facets = rows ~ columns = produce different subplots by factor variables specified (rows/columns)
    • "." indicates there are no addition row or column
    • facets = . ~ columns = creates 1 by col subplots
    • facets = row ~ . = creates row row by 1 subplots
    • labels get generated automatically based on factor variable values
    • example

  • density smooth: smooths the histograms into a line tracing its shape
    • geom = "density" = replaces the default scatterplot with density smooth curve
    • example

  • ggplot()
    • built up in layers/modularly (similar to base plotting system)
      • data \(\rightarrow\) overlay summary \(\rightarrow\) metadata/annotation
    • g <- ggplot(data, aes(var1, var2))
      • initiates call to ggplot and specifies the data frame that will be used
      • aes(var1, var2) = specifies aesthetic mapping, or var1 = x variable, and var2 = y variable
      • summary(g) = displays summary of ggplot object
      • print(g) = returns error (“no layer on plot”) which means the plot does know how to draw the data yet
    • g + geom_point() = takes information from g object and produces scatter plot
    • + geom_smooth() = adds low S mean curve with confidence interval
      • method = "lm" = changes the smooth curve to be linear regression
      • size = 4, linetype = 3 = can be specified to change the size/style of the line
      • se = FALSE = turns off confidence interval
    • + facet_grid(row ~ col) = splits data into subplots by factor variables (see facets from qplot())
      • conditioning on continous variables is possible through cutting/making a new categorical variable
      • cutPts <- quantiles(df$cVar, seq(0, 1, length=4), na.rm = TRUE) = creates quantiles where the continuous variable will be cut
        • seq(0, 1, length=4) = creates 4 quantile points
        • na.rm = TRUE = removes all NA values
      • df$newFactor <- cut(df$cVar, cutPts) = creates new categorical/factor variable by using the cutpoints
        • creates n-1 ranges from n points = in this case 3
    • annotations:
      • xlab(), ylab(), labs(), ggtitle() = for labels and titles
        • labs(x = expression("log " * PM[2.5]), y = "Nocturnal") = specifies x and y labels
        • expression() = used to produce mathematical expressions
      • geom functions = many options to modify
      • theme() = for global changes in presentation
        • example: theme(legend.position = "none")
      • two standard themes defined: theme_gray() and theme_bw()
      • base_family = "Times" = changes font to Times
    • aesthetics
      • + geom_point(color, size, alpha) = specifies how the points are supposed to be plotted on the graph (style)
        • Note: this translates to geom_line()/other forms of plots
        • color = "steelblue" = specifies color of the data points
        • aes(color = var1) = wrapping color argument this way allows a factor variable to be assigned to the data points, thus subsetting it with different colors based on factor variable values
        • size = 4 = specifies size of the data points
        • alpha = 0.5 = specifies transparency of the data points
      • example
    Alpha Level

    Alpha Level

    • axis limits
      • + ylim(-3, 3) = limits the range of y variable to a specific range
        • Note: ggplot will exclude (not plot) points that fall outside of this range (outliers), potentially leaving gaps in plot
      • + coord_cartesian(ylim(-3, 3)) = this will limit the visible range but plot all points of the data

Color Packages in R Plots

grDevices Package

  • colors() function = lists names of colors available in any plotting function
  • colorRamp function
    • takes any set of colors and return a function that takes values between 0 and 1, indicating the extremes of the color palette (e.g. see the gray function)
    • pal <- colorRamp(c("red", "blue")) = defines a colorRamp function
    • pal(0) returns a 1 x 3 matrix containing values for RED, GREEN, and BLUE values that range from 0 to 255
    • pal(seq(0, 1, len = 10)) returns a 10 x 3 matrix of 10 colors that range from RED to BLUE (two ends of spectrum defined in the object)
    • example
##       [,1] [,2]   [,3]
## [1,] 84.15    0 170.85
  • colorRampPalette function
    • takes any set of colors and return a function that takes integer arguments and returns a vector of colors interpolating the palette (like heat.colors or topo.colors)
    • pal <- colorRampPalette(c("red", "yellow")) defines a colorRampPalette function
    • pal(10) returns 10 interpolated colors in hexadecimal format that range between the defined ends of spectrum
    • example
##  [1] "#FF0000" "#FF1C00" "#FF3800" "#FF5500" "#FF7100" "#FF8D00" "#FFAA00"
##  [8] "#FFC600" "#FFE200" "#FFFF00"
  • rgb function
    • red, green, and blue arguments = values between 0 and 1
    • alpha = 0.5 = transparency control, values between 0 and 1
    • returns hexadecimal string for color that can be used in plot/image commands
    • colorspace package cna be used for different control over colors
    • example

RColorBrewer Package

  • can be found on CRAN that has predefined color palettes
    • library(RColorBrewer)
  • types of palettes
    • Sequential = numerical/continuous data that is ordered from low to high
    • Diverging = data that deviate from a value, increasing in two directions (i.e. standard deviations from the mean)
    • Qualitative = categorical data/factor variables
  • palette information from the RColorBrewer package can be used by colorRamp and colorRampPalette functions
  • available colors palettes

  • brewer.pal(n, "BuGn") function
    • n = number of colors to generated
    • "BuGn" = name of palette
      • ?brewer.pal list all available palettes to use
    • returns list of n hexadecimal colors
  • example

  • smoothScatter function
    • used to plot large quantities of data points
    • creates 2D histogram of points and plots the histogram
    • default color scheme = “Blues” palette from RColorBrewer package
    • example

Case Study: Human Activity Tracking with Smart Phones

Loading Training Set of Samsung S2 Data from UCI Repository

## 
##   laying  sitting standing     walk walkdown   walkup 
##     1407     1286     1374     1226      986     1073

Plotting Average Acceleration for First Subject

Plotting Max Acceleration for the First Subject

Case Study: Fine Particle Pollution in the U.S. from 1999 to 2012

Read Raw Data from 1999 and 2012

Summaries for Both Periods

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -10.00    4.00    7.63    9.14   12.00  908.97   73133
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.20   11.50   13.74   17.90  157.10   13217
##     NA.1990    NA.2012
## 1 0.1125608 0.05607125

Make a boxplot of both 1999 and 2012

Check for Negative Values in ‘x1’

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -10.00    4.00    7.63    9.14   12.00  908.97   73133
## [1] 26474
## [1] 0.0215034

Check Same New York Monitors at 1999 and 2012

##  [1] "1.5"     "1.12"    "5.80"    "13.11"   "29.5"    "31.3"    "63.2008"
##  [8] "67.1015" "85.55"   "101.3"

Find how many observations available at each monitor

##    1.12     1.5   101.3   13.11    29.5    31.3    5.80 63.2008 67.1015 
##      61     122     152      61      61     183      61     122     122 
##   85.55 
##       7
##    1.12     1.5   101.3   13.11    29.5    31.3    5.80 63.2008 67.1015 
##      31      64      31      31      33      15      31      30      31 
##   85.55 
##      31

Choose Monitor where County = 63 and Side ID = 2008

## [1] 30 29
## [1] 122  29

Plot Data for 2012

Plot data for 1999

Panel Plot for Both Years

Find State-wide Means and Trend

## [1] 52  3
##   state    mean.x    mean.y
## 1     1 19.956391 10.126190
## 2    10 14.492895 11.236059
## 3    11 15.786507 11.991697
## 4    12 11.137139  8.239690
## 5    13 19.943240 11.321364
## 6    15  4.861821  8.749336